#read data
rm(list=ls())
setwd("~/Documents/UCI/Customer and Social Analytics/Midterm")
highnode = read.csv("HighNote Data Midterm.csv")
library(psych)
library(dplyr)
options(digits = 5)
options(scipen=3)
#Overall summary statistics
describe(highnode[-1])
#Summary statistics for adopter
describe(filter(highnode, adopter == 1)[-1])
#Summary statistics for non-adopter
describe(filter(highnode, adopter == 0)[-1])
#Compare means between the adopter and non-adapter subsamples
for (i in 1:15) {
#skip adopter
if(i == 13) next()
print(names(highnode[i+1]))
print(t.test(highnode[,i+1]~highnode$adopter))
}
## [1] "age"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -17, df = 4080, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.2658 -1.7971
## sample estimates:
## mean in group 0 mean in group 1
## 23.948 25.980
##
## [1] "male"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -13.7, df = 4300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.122787 -0.091954
## sample estimates:
## mean in group 0 mean in group 1
## 0.62186 0.72923
##
## [1] "friend_cnt"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -10.6, df = 3680, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25.154 -17.330
## sample estimates:
## mean in group 0 mean in group 1
## 18.492 39.734
##
## [1] "avg_friend_age"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -15.7, df = 4140, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.6089 -1.2509
## sample estimates:
## mean in group 0 mean in group 1
## 24.011 25.441
##
## [1] "avg_friend_male"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -4.44, df = 4590, p-value = 0.0000091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02884 -0.01118
## sample estimates:
## mean in group 0 mean in group 1
## 0.61659 0.63660
##
## [1] "friend_country_cnt"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -21.3, df = 3790, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.5288 -2.9331
## sample estimates:
## mean in group 0 mean in group 1
## 3.9579 7.1888
##
## [1] "subscriber_friend_cnt"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -12.3, df = 3630, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4139 -1.0248
## sample estimates:
## mean in group 0 mean in group 1
## 0.41747 1.63680
##
## [1] "songsListened"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -21.6, df = 3790, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -17634 -14703
## sample estimates:
## mean in group 0 mean in group 1
## 17589 33758
##
## [1] "lovedTracks"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -21.2, df = 3710, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -193.94 -161.09
## sample estimates:
## mean in group 0 mean in group 1
## 86.823 264.341
##
## [1] "posts"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -4.22, df = 3660, p-value = 0.000026
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.3067 -8.5082
## sample estimates:
## mean in group 0 mean in group 1
## 5.293 21.200
##
## [1] "playlists"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -8.08, df = 3630, p-value = 8.6e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.43676 -0.26621
## sample estimates:
## mean in group 0 mean in group 1
## 0.54928 0.90077
##
## [1] "shouts"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -3.57, df = 3540, p-value = 0.00037
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -107.662 -31.272
## sample estimates:
## mean in group 0 mean in group 1
## 29.973 99.440
##
## [1] "tenure"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = -5.04, df = 4150, p-value = 0.00000048
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4626 -1.0840
## sample estimates:
## mean in group 0 mean in group 1
## 43.810 45.583
##
## [1] "good_country"
##
## Welch Two Sample t-test
##
## data: highnode[, i + 1] by highnode$adopter
## t = 8.8, df = 4250, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.054636 0.085954
## sample estimates:
## mean in group 0 mean in group 1
## 0.35779 0.28750
After comparing the adopter and non-adopter subsamples, it could be concluded that adopter group has following attributes:
older (2 years), male (10% more), and not from US, UK or Germany (7% less)
more friend (2 times), average friend’s age is older (1 year), from more countries (2 times), and more likely a subscriber (4 times)
longer time on the site (2 more month), listened to more songs (2 times), love more tracks (3 times), have more posts (4 times), playlist (2 times), and shouts (3 times)
The age distribution between non-adopter and adopter is almost the same. The average age of adopter is slightly larger than non-adopter. In the adopter group, the proportion of male is larger than non-adopter group. It can be concluded that the typical adopter is older male from the perspective of demographics.
library(ggplot2)
library(gridExtra)
highnode$adopter = factor(highnode$adopter, labels = c("non-adopter","adopter"))
#age histogram
g1 = ggplot(highnode, aes(x = age))+
geom_histogram(aes(fill=adopter), binwidth = 1)+
labs(title="Age Histogram",
subtitle="bin_width = 1",
fill="",
caption="Source: HighNode")
#age density
g2 = ggplot(highnode, aes(x = age))+
geom_density(aes(fill=adopter), alpha=0.6)+
labs(title="Age Density",
fill="",
caption="Source: HighNode")
#gender bar
highnode$male = factor(highnode$male, labels = c("female","male"))
g3 = ggplot(filter(highnode, adopter=="adopter"), aes(adopter) )+
geom_bar(aes(fill=male), width = 0.3)+
labs(title="Gender in adopters",
fill="",
caption="Source: HighNode")+
xlab(" ")
g4 = ggplot(filter(highnode, adopter=="non-adopter"), aes(adopter) )+
geom_bar(aes(fill=male), width = 0.3)+
labs(title="Gender in non-adopters",
fill="",
caption="Source: HighNode")+
xlab(" ")
#plot
grid.arrange(g1,g2,g3,g4,
top = "Demographics")
If the subscription rate of friends is defined as subscriber_friend_cnt/friend_cnt, the subscription rate of adopter group is more than 2 times higher than non-adopter group. This means a friend of adopter is more likely to be a subscriber. Also, the relationship between the number of friends and the number of friends who are premium subscribers is close to liner. There is no significant decrease when the number of friends getting large. As mentioned in summary statistics, adopter group will have more friends. In a word, the peer influence does exist. In addition, the adopter group has more friends who have larger average age, are from more countries, and are more likely to be a subscriber, which is shown in the t test above.
#Friend_cnt vs Subscriber_friend_cnt
g5 = ggplot(highnode, aes(x = friend_cnt, y = subscriber_friend_cnt))+
geom_point(aes(color = adopter), size = 1, alpha = 0.8)+
geom_smooth(aes(color = adopter), se = F)+
coord_cartesian(xlim=c(0, 2000), ylim=c(0, 75)) +
labs(title="Friend_cnt vs Subscriber_friend_cnt",
color = " ",
caption="Source: HighNode")
#subscription rate
highnode = mutate(highnode, subscription_rate = subscriber_friend_cnt/friend_cnt)
adopter_subscription_rate = mean(highnode$subscription_rate[highnode$adopter == "adopter"])
nonadopter_subscription_rate = mean(highnode$subscription_rate[highnode$adopter == "non-adopter"])
g6 = ggplot(highnode, aes(x = ID, y = subscription_rate))+
geom_point(aes(color = adopter), size = 0.7, alpha = 0.8)+
labs(title="Subscription rate of friends",
color = " ",
y = "Rate",
caption="Source: HighNode")
g7 = ggplot(highnode, aes(x = adopter, y = subscription_rate)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3, fill = "seagreen")
highnode$subscription_rate = NULL
#plot
grid.arrange(g6,g7,g5,
layout_matrix = rbind(c(1,1),
c(2,3)),
top = "Peer influence")
In each aspects of user engagement, adopter group is higher than non-adopter group. It could be assumed that the more user engaged in the site, the more likely he/she will become adopter.
#songsListened
g8 = ggplot(highnode, aes(x = adopter, y = songsListened)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x = " ")
#lovedTracks
g9 = ggplot(highnode, aes(x = adopter, y = lovedTracks)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x =" ")
#posts
g10 = ggplot(highnode, aes(x = adopter, y = posts)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x = "Group")
#playlists
g11 = ggplot(highnode, aes(x = adopter, y = playlists)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x = " ")
#shouts
g12 = ggplot(highnode, aes(x = adopter, y = shouts)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x = " ")
g13 = ggplot(highnode, aes(x = adopter, y = tenure)) +
geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
labs(x = " ")
#plots
grid.arrange(g8,g9,g10,g11,g12,g13,nrow = 1,top = "User engagement", bottom = "Source: HighNode")
#create a variable to represent treatment
highnode = mutate(highnode, treat = ifelse(subscriber_friend_cnt>=1,1,0))
#test the difference before PSM
t.test((as.integer(highnode$adopter)-1)~highnode$treat)
##
## Welch Two Sample t-test
##
## data: (as.integer(highnode$adopter) - 1) by highnode$treat
## t = -31, df = 11800, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.13303 -0.11719
## sample estimates:
## mean in group 0 mean in group 1
## 0.052435 0.177543
#logistic regression
m_ps <- glm(treat ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt
+ songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country,
family = binomial(), data = highnode)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m_ps)
##
## Call:
## glm(formula = treat ~ age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + songsListened + lovedTracks +
## posts + playlists + shouts + tenure + good_country, family = binomial(),
## data = highnode)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.421 -0.567 -0.422 -0.300 2.562
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.143967704 0.077025853 -66.78 < 2e-16 ***
## age 0.019697667 0.002807995 7.01 2.3e-12 ***
## malemale 0.043114235 0.029980958 1.44 0.15042
## friend_cnt 0.031321849 0.001033703 30.30 < 2e-16 ***
## avg_friend_age 0.079550730 0.003481500 22.85 < 2e-16 ***
## avg_friend_male 0.251388142 0.050285006 5.00 5.8e-07 ***
## friend_country_cnt 0.111034819 0.004764945 23.30 < 2e-16 ***
## songsListened 0.000006906 0.000000516 13.40 < 2e-16 ***
## lovedTracks 0.000667065 0.000056452 11.82 < 2e-16 ***
## posts 0.000569908 0.000268232 2.12 0.03361 *
## playlists 0.005638941 0.011897559 0.47 0.63553
## shouts -0.000049085 0.000037068 -1.32 0.18543
## tenure -0.002571113 0.000776896 -3.31 0.00093 ***
## good_country 0.032012002 0.029217524 1.10 0.27323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46640 on 43826 degrees of freedom
## Residual deviance: 34170 on 43813 degrees of freedom
## AIC: 34198
##
## Number of Fisher Scoring iterations: 7
#predict PSM value
prs_df <- data.frame(pr_score = predict(m_ps, type = "response"),
treat = m_ps$model$treat)
head(prs_df)
#Examining the region of common support
labs <- paste("Actual treatment type :", c("having subscriber friends", "No subscriber friends"))
prs_df %>%
mutate(treat = ifelse(treat == 1, labs[1], labs[2])) %>%
ggplot(aes(x = pr_score)) +
geom_histogram(color = "white") +
facet_wrap(~treat) +
xlab("Probability of having subscriber friends") +
theme_bw()
#There is no missing value
summary(is.na(highnode))
## ID age male friend_cnt
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:43827 FALSE:43827 FALSE:43827 FALSE:43827
## NA's :0 NA's :0 NA's :0 NA's :0
## avg_friend_age avg_friend_male friend_country_cnt subscriber_friend_cnt
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:43827 FALSE:43827 FALSE:43827 FALSE:43827
## NA's :0 NA's :0 NA's :0 NA's :0
## songsListened lovedTracks posts playlists
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:43827 FALSE:43827 FALSE:43827 FALSE:43827
## NA's :0 NA's :0 NA's :0 NA's :0
## shouts adopter tenure good_country
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:43827 FALSE:43827 FALSE:43827 FALSE:43827
## NA's :0 NA's :0 NA's :0 NA's :0
## treat
## Mode :logical
## FALSE:43827
## NA's :0
#perform "nearest" matching algorithms
library(MatchIt)
mod_match <- matchit(treat ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt
+ songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country,
method = "nearest", data = highnode)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(mod_match)
#plot(mod_match)
#create a dataframe containing only the matched observations
highnode_match <- match.data(mod_match)
dim(highnode_match)
## [1] 19646 19
From the visual inspection and mean difference, the matching result is good but not perfect. It is shown that 8 of 13 covariates are perfectly matched, including age, male, avg_friend_age, avg_friend_male, songsListened, playlists, tenure, and good_country. While 5 of 13 covariates are not matched so well, including friend_cnt, friend_country_cnt, lovedTracks, posts, and shouts.
highnode_match$treat = as.factor(highnode_match$treat)
#Visual inspection
#create a plotting function
fn_bal <- function(dta, variable, treat) {
dta$variable <- dta[, variable]
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = treat)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
#plot
library(gridExtra)
highnode_match$male = as.integer(highnode_match$male)
grid.arrange(
fn_bal(highnode_match, "age") + theme(legend.position = "none"),
fn_bal(highnode_match, "male"),
top = "Visual inspection",
nrow = 1
)
grid.arrange(
fn_bal(highnode_match, "friend_cnt") + theme(legend.position = "none"),
fn_bal(highnode_match, "avg_friend_age"),
top = "Visual inspection",
nrow = 1
)
grid.arrange(
fn_bal(highnode_match, "avg_friend_male") + theme(legend.position = "none"),
fn_bal(highnode_match, "friend_country_cnt"),
top = "Visual inspection",
nrow = 1
)
grid.arrange(
fn_bal(highnode_match, "songsListened") + theme(legend.position = "none"),
fn_bal(highnode_match, "lovedTracks"),
top = "Visual inspection",
nrow = 1
)
grid.arrange(
fn_bal(highnode_match, "posts") + theme(legend.position = "none"),
fn_bal(highnode_match, "playlists"),
top = "Visual inspection",
nrow = 1
)
## Warning: Removed 4 rows containing missing values (geom_smooth).
grid.arrange(
fn_bal(highnode_match, "shouts") + theme(legend.position = "none"),
fn_bal(highnode_match, "tenure"),
top = "Visual inspection",
nrow = 1
)
grid.arrange(
fn_bal(highnode_match, "good_country") + theme(legend.position = "none"),
top = "Visual inspection",
nrow = 1
)
#Difference-in-means
highnode_name = names(highnode_match)[c(-1,-8,-14,-17,-18,-19)]
#summary
highnode_match %>%
group_by(treat) %>%
select(one_of(highnode_name)) %>%
summarise_all(funs(mean))
#t test
for (i in 1:13) {
print(highnode_name[i])
print(t.test(highnode_match[,highnode_name[i]]~highnode_match$treat))
}
## [1] "age"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 9.02, df = 19300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.75075 1.16761
## sample estimates:
## mean in group 0 mean in group 1
## 26.332 25.373
##
## [1] "male"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 3.14, df = 19600, p-value = 0.0017
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0080145 0.0347423
## sample estimates:
## mean in group 0 mean in group 1
## 1.6576 1.6363
##
## [1] "friend_cnt"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -24.8, df = 10500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.127 -29.982
## sample estimates:
## mean in group 0 mean in group 1
## 21.467 54.021
##
## [1] "avg_friend_age"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 13.6, df = 18400, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.99898 1.33462
## sample estimates:
## mean in group 0 mean in group 1
## 26.557 25.390
##
## [1] "avg_friend_male"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 5.47, df = 19300, p-value = 4.5e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.012411 0.026264
## sample estimates:
## mean in group 0 mean in group 1
## 0.65515 0.63581
##
## [1] "friend_country_cnt"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -38.6, df = 13900, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.5125 -4.0759
## sample estimates:
## mean in group 0 mean in group 1
## 5.0914 9.3856
##
## [1] "songsListened"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -11.4, df = 18500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7472.4 -5277.1
## sample estimates:
## mean in group 0 mean in group 1
## 27361 33736
##
## [1] "lovedTracks"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -15.5, df = 16100, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -102.314 -79.327
## sample estimates:
## mean in group 0 mean in group 1
## 134.54 225.36
##
## [1] "posts"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -5.68, df = 11000, p-value = 1.4e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.1640 -9.3273
## sample estimates:
## mean in group 0 mean in group 1
## 6.2773 20.5230
##
## [1] "playlists"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -2.95, df = 17800, p-value = 0.0032
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.119413 -0.024128
## sample estimates:
## mean in group 0 mean in group 1
## 0.67230 0.74407
##
## [1] "shouts"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -8.51, df = 10500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -79.465 -49.702
## sample estimates:
## mean in group 0 mean in group 1
## 37.236 101.820
##
## [1] "tenure"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 4.16, df = 19600, p-value = 0.000033
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.61022 1.70007
## sample estimates:
## mean in group 0 mean in group 1
## 47.704 46.549
##
## [1] "good_country"
##
## Welch Two Sample t-test
##
## data: highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 2.18, df = 19600, p-value = 0.029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0015177 0.0282084
## sample estimates:
## mean in group 0 mean in group 1
## 0.35814 0.34328
When performing t test on the after-matching data, the result of mean difference is significant. Around 17.8% of the user in treatment group is adopter, which is significantly large than the number of control group (around 8.6%, which is larger compared to the data before matching). The result is also robust in regression both with and without covariates. The coefficient treat is significant in two regression model mentioned above.
After controlling all the covariates, the treatment effect is still significant. It is reasonable to claim that the causal relationship exist between having subscriber friends and the likelihood of becoming an adopter. To be specific, having subscriber friends will increase the likelihood of becoming an adopter.
#t test
with(highnode_match, t.test( (as.integer(adopter)-1) ~ treat) )
##
## Welch Two Sample t-test
##
## data: (as.integer(adopter) - 1) by treat
## t = -18.9, df = 18100, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.100094 -0.081317
## sample estimates:
## mean in group 0 mean in group 1
## 0.086837 0.177543
#regression without covariates
fit_test1 = glm(adopter~treat, data = highnode_match, family = binomial())
summary(fit_test1)
##
## Call:
## glm(formula = adopter ~ treat, family = binomial(), data = highnode_match)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.625 -0.625 -0.426 -0.426 2.211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3529 0.0358 -65.7 <2e-16 ***
## treat1 0.8198 0.0445 18.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15345 on 19645 degrees of freedom
## Residual deviance: 14986 on 19644 degrees of freedom
## AIC: 14990
##
## Number of Fisher Scoring iterations: 5
#regression with covariates
fit_test2 = glm(adopter~treat + age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt
+ songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country,
data = highnode_match, family = binomial())
summary(fit_test2)
##
## Call:
## glm(formula = adopter ~ treat + age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + songsListened + lovedTracks +
## posts + playlists + shouts + tenure + good_country, family = binomial(),
## data = highnode_match)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.224 -0.567 -0.456 -0.370 2.526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.67493445 0.14592868 -25.18 < 2e-16 ***
## treat1 0.72927538 0.04681530 15.58 < 2e-16 ***
## age 0.01419457 0.00406911 3.49 0.00049 ***
## male 0.30396597 0.04899092 6.20 5.5e-10 ***
## friend_cnt -0.00020016 0.00027927 -0.72 0.47355
## avg_friend_age 0.01304253 0.00534905 2.44 0.01476 *
## avg_friend_male 0.06006885 0.09252467 0.65 0.51620
## friend_country_cnt 0.00727747 0.00364861 1.99 0.04609 *
## songsListened 0.00000425 0.00000053 8.03 1.0e-15 ***
## lovedTracks 0.00052108 0.00004692 11.11 < 2e-16 ***
## posts 0.00011861 0.00008891 1.33 0.18221
## playlists 0.04464860 0.01194440 3.74 0.00019 ***
## shouts 0.00011195 0.00007454 1.50 0.13316
## tenure -0.00243354 0.00121709 -2.00 0.04556 *
## good_country -0.36949176 0.04801809 -7.69 1.4e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15345 on 19645 degrees of freedom
## Residual deviance: 14493 on 19631 degrees of freedom
## AIC: 14523
##
## Number of Fisher Scoring iterations: 5
Use all variables in the regression model. The summary result shows that avg_friend_male, posts and shouts are not significant.
#data preprocssing
highnode = read.csv("HighNote Data Midterm.csv")
categorical_col_name = names(highnode)[c(3,14,16)]
for (i in 1:3) {
highnode[,categorical_col_name[i]] = as.factor(highnode[,categorical_col_name[i]])
}
#put all variables into the regression
fit1 = glm(adopter ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt +
subscriber_friend_cnt + songsListened + lovedTracks + posts + playlists + shouts + tenure +
good_country,
data = highnode, family = binomial())
summary(fit1)
##
## Call:
## glm(formula = adopter ~ age + male + friend_cnt + avg_friend_age +
## avg_friend_male + friend_country_cnt + subscriber_friend_cnt +
## songsListened + lovedTracks + posts + playlists + shouts +
## tenure + good_country, family = binomial(), data = highnode)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.353 -0.411 -0.350 -0.291 2.702
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.179243363 0.095711116 -43.67 < 2e-16 ***
## age 0.019617836 0.003477727 5.64 1.7e-08 ***
## male1 0.413269630 0.041690899 9.91 < 2e-16 ***
## friend_cnt -0.004312141 0.000491970 -8.77 < 2e-16 ***
## avg_friend_age 0.029539617 0.004483627 6.59 4.4e-11 ***
## avg_friend_male 0.116206131 0.063462474 1.83 0.067 .
## friend_country_cnt 0.043258567 0.003616344 11.96 < 2e-16 ***
## subscriber_friend_cnt 0.091319891 0.010728991 8.51 < 2e-16 ***
## songsListened 0.000007626 0.000000519 14.69 < 2e-16 ***
## lovedTracks 0.000695027 0.000049334 14.09 < 2e-16 ***
## posts 0.000084922 0.000095800 0.89 0.375
## playlists 0.059201112 0.013331442 4.44 9.0e-06 ***
## shouts 0.000110769 0.000084280 1.31 0.189
## tenure -0.004476042 0.001021990 -4.38 1.2e-05 ***
## good_country1 -0.415225836 0.040783753 -10.18 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24537 on 43826 degrees of freedom
## Residual deviance: 22613 on 43812 degrees of freedom
## AIC: 22643
##
## Number of Fisher Scoring iterations: 5
After deleting non-significant variables, the test result shows that there is no significant difference between previous model and simple model. So the simple model is preferred due to the simplicity. In addition, the is no multicollinearity problem since the VIF of all the variables are under 5, which is the rule of thumb.
#delete non-significant variables
fit2 = glm(adopter ~ age + male + friend_cnt + avg_friend_age + friend_country_cnt + avg_friend_male +
subscriber_friend_cnt + songsListened + lovedTracks + playlists + tenure + good_country,
data = highnode, family = binomial())
summary(fit2)
##
## Call:
## glm(formula = adopter ~ age + male + friend_cnt + avg_friend_age +
## friend_country_cnt + avg_friend_male + subscriber_friend_cnt +
## songsListened + lovedTracks + playlists + tenure + good_country,
## family = binomial(), data = highnode)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.390 -0.411 -0.350 -0.291 2.702
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.178014173 0.095742066 -43.64 < 2e-16 ***
## age 0.019452737 0.003479446 5.59 2.3e-08 ***
## male1 0.410787361 0.041653667 9.86 < 2e-16 ***
## friend_cnt -0.004245446 0.000488720 -8.69 < 2e-16 ***
## avg_friend_age 0.029512585 0.004487202 6.58 4.8e-11 ***
## friend_country_cnt 0.044063934 0.003614547 12.19 < 2e-16 ***
## avg_friend_male 0.115931086 0.063478277 1.83 0.068 .
## subscriber_friend_cnt 0.091497113 0.010734994 8.52 < 2e-16 ***
## songsListened 0.000007738 0.000000515 15.02 < 2e-16 ***
## lovedTracks 0.000699438 0.000049270 14.20 < 2e-16 ***
## playlists 0.058975070 0.013314780 4.43 9.5e-06 ***
## tenure -0.004417671 0.001021436 -4.32 1.5e-05 ***
## good_country1 -0.416004121 0.040777587 -10.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24537 on 43826 degrees of freedom
## Residual deviance: 22618 on 43814 degrees of freedom
## AIC: 22644
##
## Number of Fisher Scoring iterations: 5
#Model comparison -> fit2 is better for simplicity
anova(fit1, fit2, test="Chisq")
#Multicollinearity -> all the variables are smaller than 5, which is the rule of thumb
library(car)
vif(fit2)
## age male friend_cnt
## 2.0277 1.0609 4.1079
## avg_friend_age friend_country_cnt avg_friend_male
## 2.0624 2.6171 1.0421
## subscriber_friend_cnt songsListened lovedTracks
## 2.9224 1.2657 1.1482
## playlists tenure good_country
## 1.0441 1.2129 1.0295
A rule of thumb of McFadden’s pseudo R-squared is that ranging from 0.2 to 0.4 indicates very good model fit. The McFadden’s Pseudo-R2 of the model is 0.0782, which is not very good.
The AUC of the ROC = 0.739, which is good.
The overall accuracy is 0.649. Cohen’s kappa of the confusion matrix is 0.128, meaning that the model increases the accuracy by 12.8% compared to random classification
#Use 'McFadden's Pseudo-R2'. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.
#detailed can be find in: https://www.r-bloggers.com/evaluating-logistic-regression-models/
library(pscl)
pR2(fit2)
## llh llhNull G2 McFadden r2ML
## -11309.019376 -12268.456570 1918.874387 0.078204 0.042838
## r2CU
## 0.099924
#plot ROC curve
library(pROC)
g = roc(highnode$adopter ~ predict(fit2,type = "response"))
plot(g,print.thres = "best",print.auc=TRUE)
#find the best threshold
best_thres = coords(g, "best", ret=c("threshold", "specificity", "sensitivity"))
print(best_thres)
## threshold specificity sensitivity
## 0.072675 0.643548 0.705982
#confusion matrix and Kappa
library(caret)
as.numeric(predict(fit2, newdata = highnode, type = "response") > best_thres[1]) %>%
confusionMatrix(reference = highnode$adopter)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25935 1037
## 1 14365 2490
##
## Accuracy : 0.649
## 95% CI : (0.644, 0.653)
## No Information Rate : 0.92
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.128
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.644
## Specificity : 0.706
## Pos Pred Value : 0.962
## Neg Pred Value : 0.148
## Prevalence : 0.920
## Detection Rate : 0.592
## Detection Prevalence : 0.615
## Balanced Accuracy : 0.675
##
## 'Positive' Class : 0
##
#library(measures)
#KAPPA(highnode$adopter, #truth
# as.numeric(predict(fit2, newdata = highnode, type = "response") > best_thres[1])) #predicted values
Generally, the following variables have a positive effect on the likelihood of becoming an adopter, including
The following variables have a negative effect on the likelihood of becoming an adopter, including
But friend_cnt, songsListened, lovedTracks, and tenure may not be significant in economical level due to the small coefficient value. For example, if the user listened to one more songs, the odds of becoming an adopter (p/(1-p)) will increase to around 1.000008 time, which is quite small.
Three kinds of variables should be attached more importance to, which are gender (male and avg_friend_male), subscriber_friend_cnt, and good_country due to the large coefficient value.
#odds ratio
exp(coef(fit2))
## (Intercept) age male1
## 0.015329 1.019643 1.508005
## friend_cnt avg_friend_age friend_country_cnt
## 0.995764 1.029952 1.045049
## avg_friend_male subscriber_friend_cnt songsListened
## 1.122918 1.095814 1.000008
## lovedTracks playlists tenure
## 1.000700 1.060749 0.995592
## good_country1
## 0.659678
From the descriptive statistics, it is shown that a typical adopter is older male with many male friends. In the meantime, the regression analysis shows that large age and male-gender (male and avg_friend_male) will increase the possibility of becoming an adopter. As a result, for High Node, the “free to fee” strategy should focus more on the older male since they are the target users. The company could display more ads to this group of user and develop more function to satisfy their needs to attract them to pay.
From the data visualization, adopter group is higher than non-adopter group in each aspects of user engagement, meaning that the more user engaged in the site, the more likely he/she will become adopter. Generally, the company should encourage users to interact with the site more frequently, such as recommending more songs to users.
The company could consider to extend their business globally. This is because the regression analysis shows that user who is not from US, UK or Germany is more willing to pay. For example, if High Node attract more Asian users, the overall subscription rate would tend to increase.